suppressWarnings({
  library(IsoMatchMS)
  library(pspecterlib)
  library(ggplot2)
  library(xlsx)
})

1 What is IsoMatchMS?

IsoMatchMS was designed to match biomolecule isotope profiles to spectra, allowing for the characterization of high-quality biomolecule annotations. Best isotope matches are visualized in a trelliscope display, which allows for easy sorting of isotope matches by score.

Alt text
Alt text

Much of the backend functionality is drawn from the pspecterlib package, including generating molecular formula objects. More information about the backend package can be found here.

2 Wrapper Function - run_isomatchms()

We encourage new users to run IsoMatchMS with a single wrapper function, run_isomatchms(). The function requires, at a minimum, a vector containing either ProForma strings or molecular formulas of the biomolecules to match, a Settings File specifying the runs parameters in the form of an .xlsx, and finally a pspecterlib peak_data object.

More examples can be found below. Example datasets can be found in the ProteoMatch/inst/extdata directory.

2.1 Intact Proteins Example

The following shows an example use of the run_isomatchms() function containing intact protein test data (also known as top-down proteomics). These proteins are “intact” and have not been digested into peptides.

Below, the .csv containing protein information is loaded in, along with the pre-made pspecterlib peak_data object. Further instructions on how to make this can be found in Section 2.3.

Notice that in most cases, the most abundant isotope is not the monoisotopic mass (indicated by the blue dashed line).

# Read in the protein test data
protein_data <- read.csv(system.file("extdata", "Intact_Proteins_List_Short.csv", package = "IsoMatchMS"))
intact_peak_data <- readRDS(system.file("extdata", "Intact_PeakData.RDS", package = "IsoMatchMS"))

# Run the IsoMatchMS wrapper function
run_isomatchms(
    Biomolecules = protein_data$Proteoform,
    BioType = "ProForma",
    SummedSpectra = intact_peak_data,
    SettingsFile = system.file("extdata", "Intact_Protein_Defaults.xlsx", package = "IsoMatchMS"),
    Path = "~/Downloads/Small_Intact_Test",
    Identifiers = protein_data$Protein.accession
)

2.2 Peptides Example

The following shows an example use of the run_isomatchms() function with the peptides test data (also known as bottom-up proteomics). In other words, the proteins have been digested into peptides. The most abundant isotope tends to be the monoisotopic mass.

# Read in the peptide test data 
peptide_data <- read.csv(system.file("extdata", "Peptides_List_Short.csv", package = "IsoMatchMS"))
digested_peak_data <- readRDS(system.file("extdata", "Peptides_PeakData.RDS", package = "IsoMatchMS"))

# Run the IsoMatchMS wrapper function
run_isomatchms(
    Biomolecules = peptide_data$Proteoform,
    BioType = "ProForma",
    SummedSpectra = digested_peak_data,
    SettingsFile = system.file("extdata", "Peptides_Defaults.xlsx", package = "IsoMatchMS"),
    Path = "~/Downloads/Small_Peptides_Test",
    Identifiers = peptide_data$Protein
)

2.3 Glycan Example

Similar to the peptide example, the most abundant isotope tends to be the monoisotopic mass.

# Read in the glycan test data 
glycans_data <- read.xlsx(system.file("extdata", "Glycans_List_Short.xlsx", package = "IsoMatchMS"), 1)
glycans_peak_data <- readRDS(system.file("extdata", "Glycans_PeakData.RDS", package = "IsoMatchMS"))

# Run the IsoMatchMS wrapper function
run_isomatchms(
    Biomolecules = glycans_data$formula,
    BioType = "Molecular Formula",
    SummedSpectra = glycans_peak_data,
    SettingsFile = system.file("extdata", "Glycans_Defaults.xlsx", package = "IsoMatchMS"),
    Path = "~/Downloads/Small_Glycans_Test",
    Identifiers = glycans_data$name
)

3 Accessory Functions

The IsoMatchMS package offers several stand-alone functions, including the ability to extract ProForma strings from mzid files pull_modifications_from_mzid(), converting several protein annotations to ProForma with create_proforma(), and summing ms1 spectra with sum_ms1_spectra(). All other capabilities support the main IsoMatchMS function. These accessory functions were built to assist users in formatting their data appropriately to run the main IsoMatchMS pipeline.

3.1 Generating ProForma Strings

IsoMatchMS requires either molecular formulas or the ProForma string format, which is the default format from TopPIC. However, if the data contains output from MSPathFinder, ProSight, or pTop instead, the create_proforma() function can be used to convert various peptide/protein/proteoform annotations to the ProForma Format. IsoMatchMS can also derive ProForma strings from an .mzid file by using the pull_modifications_from_mzid function.

3.1.1 MSPathFinder, ProSight, and pTop

The following examples show how to use the create_proforma function on data from the MSPathFinder, ProSight, and pTop.

# MSPathFinder example with scan and protein
create_proforma(
   Sequence = c("GRGKTGGKARAKAKSRSSRAGLQFPVGRVHRL", "KKTRIIPRHLQLAIRNDEELNKLLGGVTIAY", "TEST"),
   Modifications = c("Methyl 8,Phospho 12,Phospho 22,DiMethyl 30", "TriMethyl 6,Phospho 15", ""),
   Tool = "MSPathFinder",
   Scan = c(3334, 3336, 3338),
   Protein = c("Protein33", "Protein45", "Protein47")
)
#>   Scan                                                           Proteoform
#> 1 3334 GRGKTGGK[Methyl]ARAK[Phospho]AKSRSSRAGL[Phospho]QFPVGRVH[DiMethyl]RL
#> 2 3336                  KKTRII[TriMethyl]PRHLQLAIR[Phospho]NDEELNKLLGGVTIAY
#> 3 3338                                                                 TEST
#>     Protein
#> 1 Protein33
#> 2 Protein45
#> 3 Protein47
# ProSight example without scan and protein
create_proforma(
   Sequence = c("(49)M(37)SGRGKQG", "SG(67)RGKQGGKARAKAKSRSSRAG", "TEST"),
   Modifications = c("N-acetyl-L-methionine (49), O-phospho-L-serine (37)", "omega-N,omega-N'-dimethyl-L-arginine (67)", ""),
   Tool = "ProSight",
   ConversionList = list("N-acetyl-L-methionine" = "Acetyl", "O-phospho-L-serine" = "Phospho", 
                         "omega-N,omega-N'-dimethyl-L-arginine" = "DiMethyl")
)
#>   Scan                       Proteoform Protein
#> 1   NA        [Acetyl]M[Phospho]SGRGKQG      NA
#> 2   NA SG[DiMethyl]RGKQGGKARAKAKSRSSRAG      NA
#> 3   NA                             TEST      NA
# pTop Example
create_proforma(
   Sequence = c("SGRGKGGKGLGKGGAKRHRKVLRDNIQGITKPAIRRL", "TEST", "PEPSRSTPAPKKGSKKAITKAQKKDGKKRKRGRKESYSIYV"),
   Modifications = c("(20)Dimethyl[K];(16)Acetyl[K];(0)Acetyl[AnyN-term];", "", "(20)Dimethyl[K];"),
   Tool = "pTop"
)
#>   Scan                                                      Proteoform Protein
#> 1   NA [Acetyl]SGRGKGGKGLGKGGAK[Acetyl]RHRK[Dimethyl]VLRDNIQGITKPAIRRL      NA
#> 2   NA                                                            TEST      NA
#> 3   NA             PEPSRSTPAPKKGSKKAITK[Dimethyl]AQKKDGKKRKRGRKESYSIYV      NA

3.1.2 .mzid file

Here, we download an example .mzid file, then use the pull_modifications_from_mzid function to create a dataframe of scan numbers, proforma strings, and protein names.

# Create a temporary directory 
tmpdir <- tempdir()

# Copy example data to temporary directory
file <- "https://raw.githubusercontent.com/EMSL-Computing/pspecter/master/pspecter_container/TestFiles/BottomUp/BottomUp.mzid"
download.file(file, file.path(tmpdir, tail(unlist(strsplit(file, "/")), 1)))

# Pull ProForma string from an mzid file 
mod_bu <- pull_modifications_from_mzid(file.path(tmpdir, "BottomUp.mzid"))

3.2 Summing MS1 Spectra

IsoMatchMS has the ability to sum the spectra contained in the .mzML file before running the main pipeline function with sum_ms1_spectra().

# Create a temporary directory and copy example data there
tmpdir <- tempdir()

# Download the test data
file <- "https://raw.githubusercontent.com/EMSL-Computing/pspecter/master/pspecter_container/TestFiles/BottomUp/BottomUp.mzML"
download.file(file, file.path(tmpdir, tail(unlist(strsplit(file, "/")), 1)))

# Summing the spectra
SummedSpec <- sum_ms1_spectra(mzMLPath = file.path(tmpdir, "BottomUp.mzML"), MinimumAbundance = 0.01)

4 Main Pipeline Functions

4.1 Settings File Parameters

The ProteoMatch package requires a “Settings” file in the format of on .xlsx. Some examples of this format are provided inst/extdata directory. The example files show the required parameters, suggestions for how often they will need to be changed between runs, and a description of the parameter. We’ll take a look at the example here.

SettingsFile <- xlsx::read.xlsx(system.file("extdata", "Intact_Protein_Defaults.xlsx", package = "IsoMatchMS"), 1)
knitr::kable(SettingsFile)
Parameter Default Update Description
MZRange 2500-20000 Everytime A range of MZ values to filter the data by. It is highly recommended that users visualize the spectra first to determine a reasonable cutoff range.
NoiseFilter 5 Everytime An abundance (every peak is scaled to the largest peak) cutoff for peaks. A reasonable value should be in the 2.5 - 10% range. If many peaks are matched to noise, increase this value.
Charges 1,2 Everytime The range of charges to test. List charges separated by a comma
AbundanceThreshold 50 Occasionally The +/- percent abundance an isotope peak can vary and still be considered a match. If 50%, and the calculated abundance is 3, the matched abundance can vary from 1.5-4
CorrelationMinimum 0.95 Occasionally The minimum correlation value to consider when generating the trelliscope display
PPMThreshold 10 Occasionally The maximum m/z error permitted. 
AdductLabels proton Occasionally Labels for the AdductMasses. Should be separated by a comma with no space (ex. proton,sodium)
AdductMasses 1.00727647 Occasionally Masses for the Adducts. Should be separated by a comma with no space (ex. 1.00727647,22.98977)
AddMAI FALSE Rarely Add most abundant isotope to the molecular formula calculation step. Warning: This will slow down the tool. 
IsotopeMinimum 5 Rarely The minimum number of isotopes to consider. We recommend 5 for intact proteomics, and 2 or 3 otherwise.
PlottingWindow 2 Rarely The -/+ m/z value on either side of the matched spectra plot. Default is 2 m/z.
IsotopingAlgorithm Rdisop Rarely Either “Rdisop” or “isopat”. “Rdisop” is more accurate and recommended, though may crash on windows OS. “isopat” may then be used as an alternative.

4.2 Creating a Peak_Data Object

Once the optional step of summing the spectra has been completed, the pspecterlib package can be used to create a peak_data object. This can be done with a .mzML file or with two vectors with MZ and Intensity values.

This is an example of reading in a summed mzML file. First, a scan_metadata object is created using the pspecterlib::get_scan_metadata function. That output is then used in the pspecterlib::get_peak_data function to create the peak_data object

# Extract the scan metadata
scan_data <- pspecterlib::get_scan_metadata(
  MSPath = system.file("extdata", "Intact_Protein_Summed_MS1.mzML", package = "IsoMatchMS")
)

# Create the peak_data object
peak_data <- pspecterlib::get_peak_data(
  ScanMetadata = scan_data, 
  ScanNumber = 1, 
  MinAbundance = 0.1
)

head(peak_data)
#>         M/Z Intensity Abundance
#> 1: 2499.990  1.298022    0.1182
#> 2: 2500.007  2.871200    0.2615
#> 3: 2500.024  4.821795    0.4391
#> 4: 2500.041  6.556417    0.5970
#> 5: 2500.057  7.286436    0.6635
#> 6: 2500.074  7.886676    0.7182

In this example, MZ and Intensity data are read in from a .csv, and the vectors in the data.frame are fed into pspecterlib::make_peak_data in order to make the peak_data object.

#Reading in .csv
pd_df <- read.csv(system.file("extdata", "Peptides_PeakData.csv", package = "IsoMatchMS"))

#Making the peak_data object
peak_data <- pspecterlib::make_peak_data(
  MZ = pd_df$M.Z, 
  Intensity = pd_df$Intensity
)

head(peak_data)

4.3 Molecular Formulas

This step, as for the remainder of the functions outlined in the remainder of this vignette, are wrapped in the run_proteomatch function. Return to section 1 to see examples of running this function.

Even if molecular formulas are provided, the calculate_molform() function is required to create a IsoMatchMS_MolForm class object. Regardless of format, the IsoMatchMS_MolForm object will be a data.table with 9 columns: Biomolecules, Identifiers, Adduct Names, Adduct Masses, Charges, Molecular Formulas, Mass Shifts, Monoisotopic Masses, and Most Abundant Isotopes.

All of these values are calculated from a combination of the ProForma strings or molecular formulas, charges, and adducts. If ProForma sequences are provided, they are trimmed to values between the first and second period. Parenthesis are removed,
and values within square brackets are extracted as post-translation modifications (PTMs). Molecular formulas, mass shifts, adducts, and charges are all tracked in this dataframe.

Here an example is shown using peptides dataset, which contains Profroma strings.

# Run two examples with two charge states
MolForm <- calculate_molform(
   Biomolecules = c("M.SS[Methyl]S.V", "M.S[Methyl]S[22]S[23].V"),
   BioType = "ProForma",
   Charge = 1:2
)

MolForm %>% knitr::kable()
Biomolecules Identifiers Adduct Name Adduct Mass Charge Molecular Formula Mass Shift Monoisotopic Mass Most Abundant Isotope
M.SS[Methyl]S.V NA proton 1.007276 1 H19C10N3O7 0 294.1296 NA
M.SS[Methyl]S.V NA proton 1.007276 2 H19C10N3O7 0 147.5684 NA
M.S[Methyl]S[22]S[23].V NA proton 1.007276 1 H19C10N3O7 45 339.1296 NA
M.S[Methyl]S[22]S[23].V NA proton 1.007276 2 H19C10N3O7 45 170.0684 NA

To access the current modifications database, use:

# Load backend glossary
Glossary <- data.table::fread(
  system.file("extdata", "Unimod_v20220602.csv", package = "pspecterlib")
)

Glossary %>% 
  head() %>% 
  dplyr::select(Modification, `Mass Change`, Residues, H, C, O, N, S) %>% 
  knitr::kable()

Report new modifications to be added to the pspecterlib github issues page.

4.4 Filter Peaks

The filter_peaks() function allows users to focus their visualization on a particular range of M/Z values, as well as filter out noise present in their data, which will improve and speed up the identification process. Larger fragment data (like top-down proteomic data) should have a higher noise filter than small fragment data (bottom-up), since there are many lowly abundant peaks with high-intact data.

The abundance values are percentages of each peak’s height compared to the largest peak. If we set a noise filter at 5, if will remove any peaks with any abundance less than 5% of the highest intensity. As a general rule, if too many peaks are matched, try upping the noise filter, and if too little, try a smaller noise filter.

4.5 3. Match Proteoform to MS1

Now, we pass the peak_data object (does not need to be filtered) and the molecular formula data.table (ProteoMatch_MolForm) to the proteoform matching function. See ?match_biomolecule_to_ms1 for a more detailed explanation of the parameters.

# Run two examples with two charge states
MolForms_Test <- calculate_molform(
   Biomolecules = c("M.SS[Methyl]S.V", "M.SS[6]S[7].V"),
   BioType = "ProForma", 
   Identifiers = c("Test1", "Test2"),
   Charge = 1:2
)

# Generate some experimental peak data to match
PeakData <- pspecterlib::make_peak_data(
  MZ = c(147.5684, 148.0699, 148.5708, 149.0721, 149.5731,
         294.1296, 295.1325, 296.1343, 297.1369, 298.1390),
  Intensity = c(868.3680036, 110.9431876, 18.7179196, 1.7871629, 0.1701294,
                868.3680036, 110.9431876, 18.7179196, 1.7871629, 0.1701294)
)
# Run algorithm
Matches <- match_biomolecule_to_ms1(
   PeakData = PeakData,
   MolecularFormula = MolForms_Test,
   IsotopeMinimum = 2
)

Matches %>% knitr::kable()
Identifiers Adduct Mass Adduct Name M/Z Mass Shift Monoisotopic Mass Abundance Isotope M/Z Search Window M/Z Experimental Intensity Experimental Abundance Experimental PPM Error Absolute Relative Error Pearson Correlation Charge Biomolecules Molecular Formula Most Abundant Isotope ID
Test1 1.007276 proton 147.5684 0 147.5684 100.000000 0 0.0014757 147.5684 868.36800 100.0000 -0.1796489 5.5e-06 1 2 M.SS[Methyl]S.V H19C10N3O7 NA 1
Test1 1.007276 proton 148.0699 0 147.5684 12.776057 1 0.0014807 148.0699 110.94319 12.7761 0.1827338 5.5e-06 1 2 M.SS[Methyl]S.V H19C10N3O7 NA 1
Test1 1.007276 proton 148.5708 0 147.5684 2.155528 2 0.0014857 148.5708 18.71792 2.1555 -0.0688777 5.5e-06 1 2 M.SS[Methyl]S.V H19C10N3O7 NA 1
Test1 1.007276 proton 294.1296 0 294.1296 100.000000 0 0.0029413 294.1296 868.36800 100.0000 0.0797234 5.5e-06 1 1 M.SS[Methyl]S.V H19C10N3O7 NA 2
Test1 1.007276 proton 295.1325 0 294.1296 12.776057 1 0.0029513 295.1325 110.94319 12.7761 0.1036306 5.5e-06 1 1 M.SS[Methyl]S.V H19C10N3O7 NA 2
Test1 1.007276 proton 296.1343 0 294.1296 2.155528 2 0.0029613 296.1343 18.71792 2.1555 -0.1485691 5.5e-06 1 1 M.SS[Methyl]S.V H19C10N3O7 NA 2

Each match is generated with a unique ID for plotting purposes. We have three metrics of peak match quality: Absolute Relative Error, Cosine Correlation, and a Figure of Merit.

The equation for Absolute Relative Error is:

\[ \frac{1}{n}*\sum{\frac{|A_R - A_E|}{A_R}} \] where n is the number of peaks matched, \(A_R\) is the reference abundance, and \(A_E\) is the experimental abundance.

4.6 Plot Results

If there is only one match generated, the results can be easily visualized with plot_Ms1Match. There are many plotting options that can be explored with ?plot_Ms1Match

plot_Ms1Match(
  PeakData = PeakData,
  Ms1Match = Matches, 
  ID = 1 # Pull whatever match you're interested in plotting
)

If there are multiple peaks, users can build a trelliscope display with proteomatch_trelliscope.

isomatchms_trelliscope(
  PeakData = PeakData,
  Ms1Match = Matches,
  Path = "~/Downloads/TrelliTest"
)

5 Other Features

5.1 Adducts

In the calculate_molform() function, users may supply a named vector of adduct masses to test out different adducts. For example:

calculate_molform(
   Biomolecules = c("C6H12O6", "C2H4O1"),
   BioType = "Molecular Formula",
   Identifiers = c("Glucose", "Acetyl"),
   AdductMasses = c(proton = 1.00727637, sodium = 22.989769),
   AddMostAbundantIsotope = TRUE
) %>% knitr::kable()
Biomolecules Identifiers Adduct Name Adduct Mass Charge Molecular Formula Mass Shift Monoisotopic Mass Most Abundant Isotope
C6H12O6 Glucose proton 1.007276 1 C6H12O6 0 181.07067 181.07066
C6H12O6 Glucose sodium 22.989769 1 C6H12O6 0 203.05316 203.05316
C2H4O1 Acetyl proton 1.007276 1 C2H4O1 0 45.03349 45.03349
C2H4O1 Acetyl sodium 22.989769 1 C2H4O1 0 67.01598 67.01598